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BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to actor-critic fuzzy reinforcement learning 
(ACFRL), and particularly to a system controlled by a convergent ACFRL 
methodology. 



2. Discussion of the Related Art 

Reinforcement learning techniques provide powerful methodologies for 
learning through interactions with the environment. Earlier, in ARIC (Berenji, 
1992) and GARIC (Berenji and Khedkar, 1992), fuzzy set theory was used to 
generalize the experience obtained through reinforcement learning between 
similar states of the environment. In recent years, we have extended Fuzzy 
Reinforcement Learning (FRL) for use in a team of heterogeneous intelligent 
agents who collaborate with each other (Berenji and Vengerov, 1999, 2000). 
It is desired to have a fuzzy system that is tunable or capable of learning 
from experience, such that as it learns, its actions, which are based on the 
content of its tunable fuzzy rulebase, approach an optimal policy. 

The use of policy gradient in reinforcement learning was first introduced 
by Williams (1992) in his actor-only REINFORCE algorithm. The algorithm 
finds an unbiased estimate of the gradient without assistance of a learned 
value function. As a result, REINFORCE learns much slower than RL meth- 
ods relying on the value function, and has received relatively little attention. 
Recently, Baxter and Barlett (2000) extended the REINFORCE algorithm to 
partially observable Markov decision processes (POMDPs). However, learn- 
ing a value function and using it to reduce the variance of the gradient 
estimate appears to be the key to successful practical applications of rein- 
forcement learning. 

The closest theoretical result to this invention is the one by Sutton et 



al. (2000). That work derives exactly the same expression for the policy 
gradient with function approximation as the one used by Konda and Tsitsik- 
lis. However, the parameter updating algorithm proposed by Sutton et al. 
based on this expression is not practical: it requires estimation of the steady 
state probabilities under the policy corresponding to each iteration of the 
algorithm as well as finding a solution to a nonlinear programming problem 
for determining the new values of the actor's parameters. Another similar 
result is the VAPS family of methods by Baird and Moore (1999). However, 
VAPS methods optimize a measure combining the policy performance with 
accuracy of the value function approximation. As a result, VAPS methods 
converge to a locally optimal policy only when no weight is put on value 
function accuracy, in which case VAPS degenerates to actor-only methods. 

ACTOR-CRITIC ALGORITHMS FOR REINFORCEMENT 

LEARNING 

Actor-critic (AC) methods were among the first reinforcement learning algo- 
rithms to use temporal-difference learning. These methods were first studied 
in the context of a classical conditioning model in animal learning by Sutton 
and Bartlo (1981). Later, Bartlo, Sutton and Anderson (1983) successfully 
applied AC methods to the cart-pole balancing problem, where they defined 
for the first time the terms actor and critic. 

In the simplest case of finite state and action spaces, the following AC 
algorithm has been suggested by Sutton and Barto (1998). After choosing 
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the action a t in the state s t and receiving the reward r u the critic evaluates 
the new state and computes the TD error: 

5 t = r t + 7 V(s t+l )-V(s t ), (1) 

where 7 is the discounting rate and V is the current value function imple- 
mented by the critic. After that, the critic updates its value function, which 
in the case of TD(0) becomes: 

V(s t )±-V{s t ) + a t 6 t , (2) 

where a t is the critic's learning rate at time t. The key step in this algorithm 
is the update of actor's parameters. If TD error is positive, the probability 
of selecting a t in the state s t in the future should be increased since the gain 
in state value outweighs the possible loss of in the immediate reward. By 
reverse logic, the probability of selecting a t in the state s t in the future should 
be decreased if the TD error is negative. Suppose the actor chooses actions 
stochastically using the Gibbs softmax method: 

e 6(s,a) 

Pr{a t = a\s t = s} = — ^ 6) , (3) 

where 0(s, a) is the value of the actor's parameter indicating the tendency of 
choosing action a in state s. Then, these parameters are updated as follows: 

6{s u a t )*-~6{s u a t ) + p t 5 u (4) 

where j3 t is the actor's learning rate at time t. 

4 
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The convergence properties of the above AC algorithm have not been 
studied thoroughly. The interest in AC algorithm subsided when Watkins 
(1989) introduced Q-learning algorithm and proved its convergence in finite 
state and action spaces. For almost a decade, Q-learning has served well the 
field of reinforcement learning (RL). However, as it is desired to apply RL 
algorithms to more complex problems, it is recognized herein that there are 
limitations to Q-learning in this regard. 

First of all, as the size of the state space becomes large or infinite (as 
is the case for continuous state problems), function approximation archi- 
tectures have to be employed to generalize Q-values across all states. The 
updating rule for the parameter vector 9 of the approximation architecture 
then becomes: 



0 t <-0 t + a t V et Q(s u a t , 9 t ){r t + 

+ ymaxQ[s t + u a,9 t ) - Q{s u a u 9 t )). (5) 

Even though Q-learning with function approximation as presented in 
equation (5) is currently widely used in reinforcement learning, it has no 
convergence guarantees and can diverge even for linear approximation ar- 
chitectures (Bertsekas and Tsitsiklis, 1996). A more serious limitation of 
the general Q-learning equation presented above becomes apparent when the 
size of the action space becomes large or infinite. In this case, a nonlin- 
ear programming problem needs to be solved at every time step to evaluate 
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max a Q(s t+U a, 0 t ), which can seriously limit applications of this algorithm 
to real-time control problems. 

The Q-learning algorithm, as well as most other RL algorithms, can be 
classified as a critic-only method. Such algorithms approximate the value 
function and usually choose actions in an e-greedy fashion with respect to 
the value function. At the opposite side of the spectrum of RL algorithms 
are actor-only methods (e.g. Williams, 1988; Jaakkola, Singh, and Jordan, 
1995). In these methods, the gradient of the performance with respect to the 
actor's parameters is directly estimated by simulation, and the parameters 
are updated in the direction of the gradient improvement. The drawback 
of such methods is that gradient estimators may have a very large variance, 
leading to a slow convergence. 

It is recognized herein that actor-critic algorithms combine the best fea- 
tures of critic-only and actor-only methods: presence of an actor allows di- 
rect computation of actions without having to solve a nonlinear programming 
problem, while presence of a critic allows fast estimation of performance gra- 
dient. The critic's contribution to the speed of gradient estimation is easily 
demonstrated by considering the reward used in computing the performance 
gradient in episode-based learning. In actor-only methods, 

Rt^r t + jr t +i + 7 2 r t+2 + ... + 7 T ~*r T , (6) 

while using the critic's estimate of the value function, 

Rt = r t + yV t (s w ). (7) 
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If {r t : t > 0} are independent random variables with a fixed variance, then 
R t clearly has a larger variance than R t , which leads to a slow stochastic 
convergence. However, Rt initially has a larger bias because Vt(s t +i) is an 
imperfect estimate of the true value function during the training phase. 

CONVERGENT ACTOR-CRITIC ALGORITHM 

Recently, Konda and Tsitsiklis (2000) presented a simulation-based AC 
algorithm and proved convergence of actor's parameters to a local optimum 
for a very large range of function approximation techniques. An actor in 
their algorithm can be any function that is parameterized by a linearly in- 
dependent set of parameters, which is twice differentiable with respect to 
these parameters, and which selects every action with a non-zero probabil- 
ity. They also suggested that the algorithm will still converge for continuous 
state-action spaces if certain ergodicity assumptions are satisfied. 

Konda and Tsitsiklis proposed two varieties of their algorithm, corre- 
sponding to TD(A) critic for 0 < A < 1 and TD(1) critic. In both variants, 
the critic is a linearly parameterized approximation architecture for the Q- 
function: 

Q^aHiy^lnvr^a), (8) 

1=1 aU * 

where p = (p 1 , p n ) denotes the parameter vector of the critic, 9 = 9 n ), 
denotes the parameter vector of the actor, and ^(5, a) denotes the proba- 
bility of taking action a when the state s is encountered, under the policy 
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corresponding to 9. Notice that the critic has as many free parameters as 
the actor, and the basis functions of the critic are completely specified by the 
form of the actor. Therefore, only one independent function approximation 
architecture needs to be specified by the modeler. 

In problems where no well-defined episodes exist, the critic also stores p, 
the estimate of the average reward under the current policy, which is updated 
according to: 

p m = p t + a t (r t - p t ). (9) 
The critic's parameter vector p is updated as follows: 

Pt+i =Pt + <xt(r t ~fh + QlKst+uat+i) - Q%{su o>t))zt, (10) 

where at is the critic's learning rate at time t and z t is an n- vector represent- 
ing the eligibility trace. In problems with well-defined episodes, the average 
cost term p is not necessary and can be removed from the above equations. 
The TD(1) critic updates z t according to: 

z _ / z t + Vln7r* t (s t +i, a m ), s t £ s 0 or a t ^ a 0 
m | Vln7ty(s £+1 , a t+ i), otherwise. 

while the TD(A) critic updates z% according to: 

Zt+i = Xzt + Vln7r^(5t + i,a m ). (11) 

It is recognized in the present invention that the update equation for actor's 
parameters may be simplified from that presented by Konda and Tsitsiklis 
(2000) by restricting 9 to be bounded. In practice, this does not reduce 

8 
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the power of the algorithm since the optimal parameter values are finite in 
well-designed actors. 

The resulting update equation is: 

9 nl = T(9 t - p t Ql\{s HU a w )Vln^e t (st + ua t ^)), (12) 

where fa is the actor's learning rate at time t and T stands for projection on 
a bounded rectangular set 0 C R n (truncation). 

The above algorithm converges if the learning rate sequences {a t }, {fa} 
are positive, nonincreasing, and satisfy & — > 0 as well as 

5 t >0 for t>0, £*t = oo, X> t 2 <oo, 

t t 

where 5 t stands for either a t or 

It is desired to have an actor-critic algorithm for use in controlling a 
system with fuzzy reinforcement learning that converges to at least approxi- 
mately a locally optimal policy. 

SUMMARY OF THE INVENTION 

In view of the above, a system is provided that is controlled by an actor- 
critic based fuzzy reinforcement learning algorithm that provides instructions 
to a processor of the system for applying actor-critic based fuzzy reinforce- 
ment learning. The system includes a database of fuzzy-logic rules for map- 
ping input data to output commands for modifying a system state, and a 
reinforcement learning algorithm for updating the fuzzy-logic rules database 
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based on effects on the system state of the output commands mapped from 
the input data. The reinforcement learning algorithm is configured to con- 
verge at least one parameter of the system state to at least approximately 
an optimum value following multiple mapping and updating iterations. A 
software program and method are also provided. 

In a preferred embodiment, the reinforcement learning algorithm may 
be based on an update equation including a derivative with respect to said 
at least one parameter of a logarithm of a probability function for taking 
a selected action when a selected state is encountered. The reinforcement 
learning algorithm may be configured to update the at least one parameter 
based on said update equation. The system may include a wireless transmit- 
ter, a wireless network, an electro-mechanical system, a financial system such 
as for portfolio management, pricing of derivative securities, granting loans 
and/or determining credit worthiness, or insurance, medical systems such as 
for determining usefulness of new drug therapies, text and data mining such 
as for web engines or web caching, and /or biologically- inspired robotics. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figures 1 and 2 illustrate in operations in a preferred method for control- 
ling a system with convergent actor-critic fuzzy reinforcement learning. 

Figures 3 and 4 illustrate backlog and interference fuzzy labels used by 
agents in the 6 fuzzy rules presented in the "SOLUTION METHODOLOGY" 
section, 
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INCORPORATION BY REFERENCE 



What follows is a cite list of references each of which is, in addition to 
those references cited above and below, and including that which is described 
as background and the summary of the invention, hereby incorporated by 
reference into the detailed description of the preferred embodiments below, 
as disclosing alternative embodiments of elements or features of the preferred 
embodiments not otherwise set forth in detail below. A single one or a 
combination of two or more of these references may be consulted to obtain a 
variation of the preferred embodiments described in the detailed description 
below. Further patent, patent application and non-patent references are 
cited in the written description and are also incorporated by reference into 
the detailed description of the preferred embodiment with the same effect as 
just described with respect to the following references: 

Baird, L. C, and Moore, A. W. (1999). "Gradient descent for general 
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Systems 11. 
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Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983) "Neuronlike ele- 
ments that can solve difficult learning control problems." IEEE Transactions 
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Berenji, H. R. and Khedkar, P. (1992) "Learning and tuning fuzzy logic 
controllers through reinforcements" , IEEE Transactions on Neural Networks, 
volume 3, no. 5, 724-740. 

Berenji, H. R. and Vengerov, D. (1999) "Cooperation and coordination 
between fuzzy reinforcement learning agents in continuous state partially 
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tional Conference on Fuzzy Systems (FUZZ-IEEE '99), pp. 621-627. 

Berenji, H, R. and Vengerov, D. (2000), "Advantages of cooperation be- 
tween reinforcement learning agents in difficult stochastic problems," Pro- 
ceedings of the 9th IEEE International Conference on Fuzzy Systems (FUZZ- 
IEEE 2000), pp. 871-876. 

Hanly, S. V. and Tse, D. N. (1999) "Power control and capacity of spread 
spectrum wireless networks," Automatica, vol. 35, no. 12, pp. 1987-2012. 

Bertsekas, D. P. and Tsitsiklis, J. N. (1996) Neuro-Dynamic Program- 
ming, Athena Scientific. 
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vances in Neural Information Processing Systems, Vol. 12. 

Kosko, B. (1992) "Fuzzy systems as universal approximators," IEEE In- 
ternational Conference on Fuzzy Systems (FUZZ-IEEE '92), pp. 1153-1162. 

Sugeno, M., Kang, G. T. (1988) "Structure identification of fuzzy model," 
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170. 
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duction. MIT Press. 
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Telecommunications Industry Association, July 1995. 

Wang, L.-X, (1992) "Fuzzy systems are universal approximators," IEEE 
International Conference on Fuzzy Systems (FUZZ-IEEE '92), pp. 1163- 
1169. 

Watkins, C. J. H. (1989) Learning from delayed rewards, Ph.D. thesis, 
Cambridge University. 
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tionst systems." Technical Report NU-CCS-88-3, Northeastern University, 
College of Computer Science. 

DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

What follows is a preferred embodiment of a system and method for con- 
vergent fuzzy reinforcement learning (FRL) as well as experimental results 
that illustrate advantageous features of the system. Below are provided a 
preferred fuzzy rulebase actor that satisfies conditions that guarantee the 
convergence of its parameters to a local optimum. The preferred fuzzy rule- 
base uses TSK rules, Gaussian membership functions and product inference. 
As an application domain, power control in wireless transmitters, character- 
ized by delayed rewards and a high degree of stochasticity, are illustrated as 
am example of a system that may be enhanced with the convergent actor- 
critic FRL (ACFRL). 
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As mentioned above, many further applications may be enhanced through 
incorporation of the control system of the preferred embodiment such as 
wireless networks, electro-mechanical systems, financial systems such as for 
portfolio management, pricing of derivative securities, granting loans and/or 
determining credit worthiness, or insurance systems, medical systems such 
as for determining usefulness of new drug therapies, text and data mining 
systems such as for web engines or web caching, image processing systems 
and /or biologically-inspired robotic systems, wherein a convergent ACFRL 
according to the preferred embodiment and similar to the illustrative wireless 
transmission example is used to enhance control features of those systems. 
Advantageously, the ACFRL algorithm of the preferred embodiment consis- 
tently converges to a locally optimal policy. In this way, the fuzzy rulebase 
is tuned such that system actions converge to optimal. 

Some advantages of the preferred embodiment include wide applicability 
to fuzzy systems that become adaptive and learn from the environment, use of 
reinforcement learning techniques which allow learning with delayed rewards, 
and convergence to optimality. The advantageous combination of these three 
features of the preferred embodiment when applied to control a system can 
be very powerful. For example, many systems that currently benefit from use 
of fuzzy control algorithms may be enhanced by including adaptive ACFRL 
control according to the preferred embodiment set forth herein. Rule-based 
and non rule-based fuzzy systems may each benefit by the preferred ACFRL 
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system. 

FUZZY RULES FOR ACTION SELECTION 

As used herein, a fuzzy rulebase is a function / that maps an input vector 
s in R K into an output vector a in R m . Multi-input-single-output (MISO) 
fuzzy systems / : R K R are preferred. 

The preferred fuzzy rulebase includes of a collection of fuzzy rules al- 
though a very simple fuzzy rulebase may include only a single fuzzy rule. A 
fuzzy rule i is a function fa that maps an input vector s in R K into a scalar 
a in R. We will consider fuzzy rules of the Takagi-Sugeno-Kang (TSK) form 
(Takagi and Sugeno, 1985; Sugeno and Kang, 1988): 

Rule i: IF si is S{ and s 2 is S\ and ... and s K is S l K THEN a is a 1 = 

where Sj are the input labels in rule i and a*- are tunable coefficients. 
Each label is a membership function p : R -t R that maps its input into 
a degree to which this input belongs to the fuzzy category (linguistic term) 
described by this label. 

A preferred fuzzy rulebase function f(s) with M rules can be written as: 

t[ ) E&to'W ' (13) 

where a % is the output recommended by rule i and w i ($) is the weight of rule 
i. The preferred multi-output system may be decomposed into a collection 
of single-output systems. 
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Where input labels have Gaussian form in equation (13): 

^(^ = 6}exp(^^=^). (14) 
2a) 

The product inference is used for computing the weight of each rule: 
w l (s) = Yl^ =l jJ, S t(sj). Wang (1992) proved that a fuzzy rulebase with the 
above specifications can approximate any continuous function on a compact 
input set arbitrarily well if the following parameters are allowed to vary: 
5j,aj,6*-, and a\ His result obviously applies when a* = aj + Ej^ajsj, in 
which case a* become the variable parameters. Making these substitutions 
into equations (13) we get: 

a = /(s) = — m « • ( 15 ) 

E^(nf =1 ^)exp(-Ef =1 ^) 

In preferred reinforcement learning applications, knowing the probability 
of taking each available action allows for exploration of the action space. A 
Gaussian probability distribution with a mean a 1 and a variance a 1 is used 
instead of a 1 in equation (15). That is, the probability of taking action a 
when the state s is encountered, under the policy corresponding to 0 (the 
vector of all tunable parameters in the fuzzy rulebase) is given by: 

, s E^exp(-^)(nf =1 6})exp(-Ef =1 ^?) 

* e (s,a) = 1 t K > ] . 16) 

E^(nf =1 6})exp(-Ef =1 ^-) 

What follows is a convergence proof for the preferred ACFRL for the case 
of a fuzzy rulebase actor specified in equation (16). 

17 
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CONVERGENCE OF FRL 

Consider a Markov decision process with a finite state space S and a finite 
action space A. Let the actor be represented by a randomized stationary 
policy (RSP) 7r, which is a mapping that assigns to each state s G S a 
probability distribution over the action space A. Consider a set of RSPs 
P — {^e\ # £ R n }, parameterized in terms of a vector 9, For each pair 
(5, a) E S x A, 7T0(s,a) denotes the probability of taking action a when the 
state s is encountered, under the policy corresponding to 0. 

The following restricted assumptions about the family of policies P are 
sufficient for convergence of the algorithm defined by equations (8)-(12): 

• Al. For each 6 G R n , the Markov chains {S m } of states and {S m , A m } 
of state-action pairs are irreducible and aperiodic, with stationary dis- 
tributions 7Vq(s) and r)$(s,a) — ^(5)^(5, a), respectively, under the 

RSP 7T0. 

• A2. 7r e {s, a) > 0, and for all 0 G R n , s G S, a G A. 

• A3. For all s G S and a G A, the map 0 — ► 7^(5, a) is twice differen- 
tiate, 

• A4. Consider the function ^(s,a) — Vln^s, a) = V^s, a)/7Te{s, a), 
which is well-defined and differentiate by A2 and A3. Then, for each 
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6 e R n , the n x n matrix G{9) defined by 

G(B) = £ Ve(s, a)Ms, of (17) 

needs to be uniformly positive definite. That is, there exists some 
€1 > 0 such that for all r € R n and 9 € iF 1 , 

r T G(0)r > ci || r || 2 . (18) 

The first assumption concerns the problem being solved, while the last 
three assumptions concern the actor's architecture. In practice, the first 
assumption usually holds because either all states communicate under As- 
sumption 2 or the learning is episodic and the system gets re-initialized at 
the end of every episode. 

Assumption A2 obviously holds for the fuzzy rulebase under consideration 
because the output is a mixture of Gaussian functions. 

We will verify the Assumption A3 directly, by differentiating the output 
of the actor with respect to all parameters. Obtaining explicit expressions 
for the derivatives of n&(s, a) will also help us in verifying Assumption A4. 
Let 

= orp(-Ef Ml ^f) l 
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Then, differentiating (16) with respect to a* we get for j = 0: 

5HTM = *W' (19) 

and for j — 1, ... , K: 

^* e {s,a) = H i '^i-Sj, (20) 

which in both cases is a product of a polynomial and exponential in aj and 
hence is different iable once again. 

Differentiating (16) with respect to the variance of the output action 
distribution o % we get: 

^,(. f a) = irL_L, (21) 

which is a fraction of polynomials and exponentials of polynomials in cr* and 
hence is differentiate once again. 

Differentiating (16) with respect to 6j we get: 

wfe{s ,a) = -(l--^->), (22) 

which is a fraction of two polynomials in 6J. and hence is differentiate once 
again. 

Differentiating (16) with respect to the input label parameter g) we get: 

which has only polynomial and exponential terms in aj and hence is differ- 
entiable once again. 
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Differentiating (16) with respect to the input label parameter sj we get 



0(5j) " v ' ; v JPG* (a)) 2 
which has only polynomial and exponential terms in gj and hence is differ- 
entiate once again. Equations (19)-(24) show that both first and second 
derivatives exist of 71*0(5, a), and hence the assumption A3 is verified. 

In order to verify assumption A4, note that the functions i = 1, rc, 
as defined in Assumption A4, can be computed by dividing the derivatives in 
equations (19)-(24) by 710(5, a). Let us rewrite the functions % = 1, ...,n, 
as vectors by evaluating them sequentially at all the state-actions pairs (5, a). 
These vectors are linearly independent because after dividing by 7T$(s, a), the 
derivatives in equations (19) -(24) are nonlinear in their arguments and no 
function is a constant multiple of another. 

Rewriting the function G(9) in the matrix form we get: 

G{9) = M T WM, 

where M is a x |j4|)-by-n matrix with if> $i as columns and W is a diagonal 
matrix with rje on the main diagonal evaluated sequentially at all state-action 
pairs. Since rjo{s, a) > 0 for every (5, a), we have for every vector r ^ 0, 

r T G{0)r = r T M T WMr = (Mr) T W(Mr) > 0 (25) 
because linear independence of columns of M implies Mr ^ 0. 
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Since f{r, 9) = r -^^- = f(kr, B) for any k > 0, it follows from (25) that 

in « =taf ^ =ClOT>0 , (26) 
|| r || 2 ||r||=i || r || 2 w v ; 

because r T G(9)r is continuous in r and thus achieves its minimum on the 
compact set S = {r :|| r ||= 1}. Since inequality (18) is obviously satisfied 
for any e at r = 0, in the light of (26) it holds for all r. That is, for any given 
9 G .R n , there exists some ei(6) > 0 such that 

r T G{9)r> €i{6) || r || 2 for all r € # n . (27) 

It remains to show that (27) holds uniformly for all 9. Since the space T 
of all 9 admitted by our FRL algorithm is bounded because of the truncation 
operator in equation (12), there exists 9* in T, the closure of T, s,t. ti{9*) > 0 
is minimal by continuity of €i(9). Hence, there exists ei — ei(#*) > 0 such 
that 

r T G{9)r 



rll 2 



for all 9 e T and for all r. Thus, the matrix G{9) is uniformly positive 
definite over the space of possible 9 admitted by our actor-critic algorithm. 

We have just verified that the fuzzy actor satisfies all the assumptions 
necessary for convergence. Therefore, the learning process in the actor-critic 
based FRL algorithm acoording to the preferred embodiment converges to 
an optimal value of the parameter vector 9, 
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ILLUSTRATIVE METHOD 

Figures 1 and 2 schematically show, in flow diagram format, an illustra- 
tive method according to a preferred embodiment. Referring to Figure 1, 
parameter t is set to t = 0 at step SI. Then the critic's parameter p and 
vectors z and p are set to 0 at step S2. At step S3, actor's parameters are 
initialized: gj = 0, crj = 1,6*. = l,aj = 0,aj = 0,0"* = 1 for i = 1...M and 
j = 1...K. These parameters are also arranged into a single vector quantity 
0 at step S3. The system state at t = 0 is denoted as s Q at step S4. At 
step S5, the actor chooses an action a 0 based on its current parameters 0 O 
and the state sq. The probability of choosing each possible action is given as 
P{a 0 = a) = 7r^(5 0 ,a). 

The method then goes to step S6, where the action a t is implemented. 
At step S7, the system moves to the next state st+i and the reward r t is 
observed. Next, a new action a t +\ based on 9 t and s t +i is chosen, wherein 
P(at+i = a) = 7rfl t (5t + i, a) at step S8. 

In step s9, the actor's output P(a t = a\s t — s) is denoted by f(a tj s t ) and 
P(af + i = a\s t +i = s) is denoted by f(a t+u 3t+i). Then partial derivatives 
of f(a t ,s t ) and /(a t+ i,s t+ i) are computed with respect to all parameters 
comprising the vector 9 t according to formulas (19)- (24), 

Then at step S10 of Figure 2, the partial derivatives of f(a t ,s t ) and 
/(a m , s t+ i) are arranged into vectors X 1 and X t+1 , each being of length n. 



23 



EL858979614US 



At step Sll, 

= fy*i (28) 

1=1 

and 

Q*( St+1) a t+1 ) = £>'X? +1 (29) 

i=l 

are computed. 

At step S12, the parameter p is updated according to equation (9) as 
p t +i = p t + a t (r t — pt). At step S13, the vector p is updated according to 
equation (10) as p w = pt + a t {r t -pt + Q%{st+u ~ Qp t (*t> a t))^t- 

In steps sl4 through sl6, the vector z is updated accroding to equation 
(11). That is, the TD(1) critic updates z t according to: 

f zt + V ln7T0 t (s t+ i, at+i), 5 t ^ 5 0 or at ^ a 0 
\ V In 71^(54+1,1 



t+1 \ V In 7ty(s m ,a t+ i), otherwise, 
while the TD(A) critic updates z t according to: 

z t+ i = Xz t + Vlnyr^ (s m , at+i). (30) 

At step S17, vector 6 is updated according to equation (12) as: 

9 W =0 t - A^(5 t+ i,a t+ i)Vln7r flt (5 t+ i,o t+ i). (31) 

In steps sl8 and sl9, if the absolute value any component of the vector 
0 t +i is determined to be greater than some large constant Mu it is truncated 
to be equal to Af*. 

Then the method proceeds to step S20 of Figure 1, wherein t is set to 
t + 1. 
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WIRELESS COMMUNICATION 

What follows is a description of an application of the preferred ACFRL 
method to a practical wireless communication problem. As mentioned above, 
similar application can be described with respect to many other types of 
systems. 

DOMAIN DESCRIPTION 

With the introduction of the IS-95 Code-Division Multiple Access (CDMA) 
standard (TIA/EIA/IS-95), the use of spread-spectrum as a multiple ac- 
cess technique in commercial wireless systems is growing rapidly in popular- 
ity. Unlike more traditional methods such as time-division multiple access 
(TDM A) or frequency-division multiple access (FDMA), the entire trans- 
mission bandwidth is shared between all users at all times. This allows to 
include more users in a channel at the expense of a gradual deterioration in 
their quality of service due to mutual interference. 

In order to improve the efficiency of sharing the common bandwidth re- 
source in this system, special power control algorithms are required. Con- 
trolling the transmitter powers in wireless communication networks provides 
multiple benefits. It allows interfering links sharing the same radio channel to 
achieve required quality of service (QoS) levels, while minimizing the power 
spent in the process and extending the battery life of mobile users. More- 
over, judicious use of power reduces interference and increases the network 
capacity. 
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Most of the research in this area, however, concentrated on voice-oriented 
"continuous traffic," which is dominant in the current generation of wireless 
networks. Next generation wireless networks are currently being designed 
to support intermittent packetized data traffic, beyond the standard voice- 
oriented continuous traffic. For example, web browsing on a mobile laptop 
computer will require such services. The problem of power control in this 
new environment is not well understood, since it differs significantly from the 
one in the voice traffic environment. 

Data traffic is less sensitive to delays than the voice traffic, but it is more 
sensitive to transmission errors. Reliability can be assured via retransmis- 
sions, which cannot be used in continuous voice traffic domains. Therefore, 
delay tolerance of data traffic can be exploited for design of efficient transmis- 
sion algorithms that adapt the power level to the current interference level in 
the channel and to transmitter-dependent factors such as the backlog level. 

PROBLEM FORMULATION 

The power control setup of Bambos and Kandukuri (2000) is preferred 
for testing the performance of the ACFRL algorithm. The transmitter is 
modeled as a finite-buffer queue, to which data packets arrive in a Poisson 
manner. When a packet arrives to a full buffer, it gets dropped and a cost L 
is incurred. The interference is uniformly distributed. At every time step k 
the agent observes current interference i(k) and backlog b(k) and chooses a 
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power level p(k). The cost C(k) incurred by a wireless transmitter (agent) is 
a weighted sum of the backlog b(k) and the power p(k) used for transmission: 

C{k) = ap{k) + b{k). (32) 

The probability s of successful transmission for a power level p and inter- 
ference i is: 

P 

*(p,i) = 1 -exp(-— ), (33) 

where 5 > 0, with higher values indicating higher level of transmission noise. 
If transmission is not successful, the packet remains at the head of the queue. 
The agent's objective is to minimize the average cost per step over the length 
of the simulation. 

In this problem, the agent faces the following dilemma when deciding 
on its power level: higher power implies a greater immediate cost but a 
smaller future cost due to reduction in the backlog. The optimal strategy 
here depends on several variables, such as buffer size, overflow cost, arrival 
rate, and dynamics of the interference. 

SOLUTION METHODOLOGY 

Analytical investigations of this problem formulation have been performed 
by Bambos and Kandurkuri (2000). They have derived an optimal policy for 
single wireless transmitter when interference is random and either follows a 
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uniform distribution or has a Markovian structure. Their strategy assumes 
that interference and backlog are used as input variables. When the agent 
observes a high interference in the channel, it recognizes that it will have 
to spend a lot of power to overcome the interference and transmit a packet 
successfully. Therefore, the agent backs off, buffers the incoming data packets 
and waits for the interference to subside. The exact level of interference at 
which agent goes into the backoff mode depends on the agents backlog. When 
the backlog is high and the buffer is likely to overflow, it is recognized herein 
that the agent should be more aggressive than when the backlog is low. 

Distribution of interference is not known a priori, and the analytical so- 
lution of Bambos and Kandukuri cannot be applied. Instead, a simulation- 
based algorithm has to be used for improving the agents behavior. Another 
significant advantage of simulation-based algorithms is that they can be ap- 
plied to much more complex formulations than the one considered above, 
such as the case of a simultaneous learning by multiple transmitters. 

Unfortunately, conventional reinforcement learning algorithms cannot be 
applied to this problem because interference is a real- valued input and power 
is a real- valued output. State space generalization is required for dealing 
with real-valued inputs. An ACFRL algorithm according to a preferred em- 
bodiment may, however, be used to tackle this problem. 

As discussed previously, Bambos and Kandukuri (2000) have shown that 
the optimal power function is hump-shaped with respect to interference, with 
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the height as well as the center of the hump steadily increasing with backlog. 
Therefore, the following rules used in the ACFRL actor have a sufficient 
expressive power to match the complexity of the optimal policy: 

If (backlog is SMALL) and (interference is SMALL) then (power is pi) 
If (backlog is SMALL) and (interference is MEDIUM) then (power is p2) 
If (backlog is SMALL) and (interference is LARGE) then (power is p3) 
If (backlog is LARGE) and (interference is SMALL) then (power is p4) 
If (backlog is LARGE) and (interference is MEDIUM) then (power is p5) 
If (backlog is LARGE) and (interference is LARGE) then (power is p6), 
where pi through p6 are the tunable parameters. The shapes of the 
backlog and interference labels are shown in Figures 3 and 4. The final 
power is drawn from a Gaussian distribution, which has as its center the 
conclusion of the above rulebase and has a fixed variance a. 

We chose to tune only a subset of all parameters in the above rulebase 
because our goal in these experiments was to demostrate the convergence 
property of ACFRL rather than the expressive capability of fuzzy logic for 
power control The six chosen parameters have the greatest effect on the 
rulebase output and are the most difficult ones to estimate from prior knowl- 
edge. 

Since we were not tuning the input label parameters, the membership 
functions may or may not be Gaussian, which were used in our proof for their 
differentiability property. Instead, we used triangular and trapezoidal labels 
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for simplicity of implementation. Figures 3 and 4 illustrate how backlog and 
interference fuzzy labels, respectively, are used by the agents. 

A difficulty in the considered power control problem is that a long wait 
occurs for determining the benefit of using a higher or a lower power. Because 
both arrivals and transmission are stochastic, many traces are used in order 
to distinguish the true value of a policy from random effects. 

In order to apply an ACFRL algorithm according to a preferred embodi- 
ment to this challenging problem, we made actors exploration more system- 
atic and separated the updates to the average cost per step, critics parameters 
and actors parameters into distinct phases. During the first phase, the al- 
gorithm runs 20 simulation traces of 500 steps each, keeping both the actor 
and the critic fixed, to estimate the average cost per step of the actors pol- 
icy. Each trace starts with the same initial backlog. In the second phase 
only critic is learning based on the average cost p obtained in the previous 
phase. This phase includes of 20 traces during which the actor always uses 
the power that is one unit higher than the recommendation of its rulebase 
and 20 traces during which the actors power is one unit lower. As opposed 
to probabilistic exploration at every time step suggested by Konda and Tsit- 
siklis, the systematic exploration is very beneficial in problems with delayed 
rewards, as it allows the critic to observe more clearly the connection be- 
tween a certain direction of exploration and the outcome. Finally, in the 
third phase, the algorithm runs 20 traces during which the critic is fixed and 
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PERIOD 



POLICY 



AVE COST STDEV 



0 

10 
10 
100 
100 



(1, 1, 1, 20, 20, 20) 



31.6 0.06 

26.3 0.08 

26.4 0.08 
26.1 0.08 
26.0 0.08 



(4.5 16.9 4.5 23.1 33.9 23.0) 
(4.0 14.4 3.8 22.9 33.1 22.8) 
(4.3 16.1 4.4 23.3 35.3 23.4) 
(4.6 16.7 4.4 23.6 36.1 23.5) 



Table 1 : Actor's performance during two independent training processes for 
uniform interference on [0,100]. 

the actor is learning. 



We have simulated the wireless power control problem with the following 
parameters: 

* Arrival Rate = 0.4 

* Initial Backlog = 10 

* Buffer Size = 20 

* Overflow Cost L = 50 

* Power Cost Factor a = 1 

* Transmission Noise 8 = 1 

We found that the ACFRL algorithm consistently converged to the same 
neighborhood for all six power parameters p; for a given initial condition. 
Each period in the experiments below consisted of all three phases. 

Table 1 shows the results of two independent runs of the ACFRL algo- 
rithm for uniform interference. Both runs started with the same initial policy 



Results 
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PERIOD 



0 

10 
10 
100 
100 



(2.7 17.7 2.8 21.7 37.2 21.9) 
(2.7 17.5 2.8 21.6 35.7 21.7) 
(2.6 17.3 2.7 22.0 40.2 22.2) 
(2.8 18.0 2.8 22.0 39.5 22.1) 



(1, 1, 1, 20, 20, 20) 



POLICY 



AVE COST STDEV 
38.9 0.06 

30.4 0.09 

30.5 0.09 
30.2 0.09 
30.1 0.08 



Table 2: Actor's performance during two independent training processes for 
Gaussian interference with mean 50 and standard deviation 20, clipped at 0. 

(Pi,P2,P3,Pi,P5,P6) = (1, 1, 1, 20, 20, 20). Table 2 shows results of the same 
setup for Gaussian interference. The average cost of each policy is obtained 
by separately running it for 100 periods with 2000 traces total. Notice that 
the average cost of the policies obtained after 100 periods is significantly lower 
than the cost of the initial policy. Also, these results show that the ACFRL 
algorithm converges very quickly (on the order of 10 periods) to a locally 
optimal policy, and keeps the parameters there if the learning continues. 

Notice that for the uniform interference, the shape of the resulting policy 
is the same as the one suggested by Bambos and Kandukuri (2000). That 
is, for a given level of backlog, the optimal power first increases with in- 
terference and then decreases. Also, as the backlog increases, the optimal 
power steadily increases for a given level of interference. The optimal policy 
for the case of uniform interference cannot be found analytically, because 
the Normal distribution is not invertible and Bambos and Kandukuri relied 
on inverting the distribution function in their calculations. However, the 
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ACFRL algorithm handles this case just as well, and the resulting optimal 
policy once again has the expected form. 

Above, an analytical foundation for the earlier work in Fuzzy Reinforce- 
ment Learning (FRL) conducted by ourselves and other researchers has been 
provided. Using an actor-critic approach to the reinforcement learning of 
Konda and Tsitsiklis (2000), a convergence proof for FRL has been postu- 
lated and then derived, where the actor is a fuzzy rulebase with TSK rules, 
Gaussian membership functions and product inference. 

The performance of the ACFRL algorithm has been tested on a challeng- 
ing problem of power control in wireless transmitters. The original actor- 
critic framework of Konda and Tsitsiklis (2000) is not adequate for dealing 
with the high degree of stochasticity and delayed rewards present in the 
power control domain. However, after separating the updates to the average 
cost per step, critics parameters, and actors parameters into distinct phases, 
the ACFRL algorithm of the preferred embodiment has shown consistent 
convergence results to a locally optimal policy. 

While exemplary drawings and specific embodiments of the present inven- 
tion have been described and illustrated, it is to be understood that that the 
scope of the present invention is not to be limited to the particular embod- 
iments discussed. Thus, the embodiments shall be regarded as illustrative 
rather than restrictive, and it should be understood that variations may be 
made in those embodiments by workers skilled in the arts without departing 
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from the scope of the present invention as set forth in the claims that follow, 
and equivalents thereof. 

In addition, in the method claims that follow, the operations have been 
ordered in selected typographical sequences. However, the sequences have 
been selected and so ordered for typographical convenience and are not in- 
tended to imply any particular order for performing the operations, except 
for those claims wherein a particular ordering of steps is expressly set forth 
or understood by one of ordinary skill in the art as being necessary. 
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